CLaRK - an XML-based System for Corpora Development

نویسندگان

Kiril Simov

Alexander Simov

Milen Kouylekov

Krassimira Ivanova

چکیده

In this paper we describe the architecture and the intended applications of the CLaRK System. The development of the CLaRK System started under the T ubingen-So a International Graduate Programme in Computational Linguistics and Represented Knowledge (CLaRK). The main aim behind the design of the system is the minimization of human intervention during the creation of corpora. Creation of corpora is still an important task for the majority of languages like Bulgarian, where the invested e ort in such development is very modest in comparison with more intensively studied languages like English, German and French. We consider the corpora creation task to be the editing, manipulation, searching and transforming of documents. Some of these tasks will be done for a single document or a set of documents, others will be done on a part of a document. Besides eÆciency of the corresponding processing in each state of the work, the most important investment is the human labor. Thus, in our view, the design of the system has to be directed to minimization of the human work. For document management, storing and querying we chose the XML technology because of its popularity and its ease for understanding. Very soon the XML technology will be a part of our lives and it will be the predominant language for data description and exchange on the Internet. Moreover, a lot of already developed standards for corpus descriptions like [XCES, 2001] and [TEI, 2001] are already adapted to the XML requirements. The core of the CLaRK System is an XML Editor which is the main interface to the system. With the help of the editor the user can create, edit or browse XML documents. To facilitate the corpus management, we enlarge the XML inventory with facilities that support linguistic work. We added the following basic language processing modules: a tokenizer with a module that supports a hierarchy of token types, a nite-state engine that supports the writing of cascaded nite-state grammars and facilities that search for nite-state patterns, the XPath query language which is able to support navigation over the whole set of mark-up of a document, mechanisms for imposing constraints over XML documents which are applicable in the context of some events. We envisage several uses for our system:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The CLaRK System Tools XML-based Corpora Development

CLaRK is an XML-based software system for corpora development. It incorporates several technologies: XML technology; Unicode; Regular Cascaded Grammars; Constraints over XML Documents. The basic components of the system are: a tagger, a concordancer, an extractor, a grammar processor, a constraint engine.

متن کامل

Development of Corpora within the CLaRK System: The BulTreeBank Project Experience

متن کامل

The CLaRK System: XML-based Corpora Development System for Rapid Prototyping

The paper presents the CLaRK System as a tool for the creation of XML-based corpora and a platform for rapid prototyping. The system provides a set of basic tools for processing XML documents. These tools include: tokenizers, regular grammars, constraints; remove, insert, extract, sort, transformation operations. Additionally, the system is equipped with a macro language which allows the creati...

متن کامل

Cascaded Regular Grammars over XML Documents

The basic mechanism of CLaRK for linguistic processing of text corpora is the cascade regular grammar processor. The main challenge to the grammars in question is how to apply them on XML encoding of the linguistic information. The system o ers a solution using an XPath language for constructing the input word to the grammar and an XML encoding of the categories of the recognized words.

متن کامل

memasysco: XML schema based metadata management system for speech corpora

The metadata management system for speech corpora “memasysco” has been developed at the Institut für Deutsche Sprache (IDS) and is applied for the first time to document the speech corpus “German Today”. memasysco is based on a data model for the documentation of speech corpora and contains two generic XML schemas that drive data capture, XML native database storage, dynamic publishing, and inf...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

CLaRK - an XML-based System for Corpora Development

نویسندگان

چکیده

منابع مشابه

The CLaRK System Tools XML-based Corpora Development

Development of Corpora within the CLaRK System: The BulTreeBank Project Experience

The CLaRK System: XML-based Corpora Development System for Rapid Prototyping

Cascaded Regular Grammars over XML Documents

memasysco: XML schema based metadata management system for speech corpora

عنوان ژورنال:

اشتراک گذاری